50 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script
The goal of this paper is the automatic identification of characters in TV
and feature film material. In contrast to standard approaches to this task,
which rely on the weak supervision afforded by transcripts and subtitles, we
propose a new method requiring only a cast list. This list is used to obtain
images of actors from freely available sources on the web, providing a form of
partial supervision for this task. In using images of actors to recognize
characters, we make the following three contributions: (i) We demonstrate that
an automated semi-supervised learning approach is able to adapt from the
actor's face to the character's face, including the face context of the hair;
(ii) By building voice models for every character, we provide a bridge between
frontal faces (for which there is plenty of actor-level supervision) and
profile (for which there is very little or none); and (iii) by combining face
context and speaker identification, we are able to identify characters with
partially occluded faces and extreme facial poses. Results are presented on the
TV series 'Sherlock' and the feature film 'Casablanca'. We achieve the
state-of-the-art on the Casablanca benchmark, surpassing previous methods that
have used the stronger supervision available from transcripts
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
Seeing Voices and Hearing Faces: Cross-modal biometric matching
We introduce a seemingly impossible task: given only an audio clip of someone
speaking, decide which of two face images is the speaker. In this paper we
study this, and a number of related cross-modal tasks, aimed at answering the
question: how much can we infer from the voice about the face and vice versa?
We study this task "in the wild", employing the datasets that are now publicly
available for face recognition from static images (VGGFace) and speaker
identification from audio (VoxCeleb). These provide training and testing
scenarios for both static and dynamic testing of cross-modal matching. We make
the following contributions: (i) we introduce CNN architectures for both binary
and multi-way cross-modal face and audio matching, (ii) we compare dynamic
testing (where video information is available, but the audio is not from the
same video) with static testing (where only a single still image is available),
and (iii) we use human testing as a baseline to calibrate the difficulty of the
task. We show that a CNN can indeed be trained to solve this task in both the
static and dynamic scenarios, and is even well above chance on 10-way
classification of the face given the voice. The CNN matches human performance
on easy examples (e.g. different gender across faces) but exceeds human
performance on more challenging examples (e.g. faces with the same gender, age
and nationality).Comment: To appear in: IEEE Computer Vision and Pattern Recognition (CVPR),
201
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video
content using natural language queries a significant challenge. Human-generated
queries for video datasets `in the wild' vary a lot in terms of degree of
specificity, with some queries describing specific details such as the names of
famous identities, content from speech, or text available on the screen. Our
goal is to condense the multi-modal, extremely high dimensional information
from videos into a single, compact video representation for the task of video
retrieval using free-form text queries, where the degree of specificity is
open-ended.
For this we exploit existing knowledge in the form of pre-trained semantic
embeddings which include 'general' features such as motion, appearance, and
scene features from visual content. We also explore the use of more 'specific'
cues from ASR and OCR which are intermittently available for videos and find
that these signals remain challenging to use effectively for retrieval. We
propose a collaborative experts model to aggregate information from these
different pre-trained experts and assess our approach empirically on five
retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and
data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
This paper contains a correction to results reported in the previous version.Comment: This update contains a correction to previously reported result